Introduction

In the last 2 previous Notebook, we trained a model of Sentiment Analysis and create an API to query live tweets, analyse them and store them in a NoSQL database. This model ran during the match France Argentian of 30/06/2018 during the World Cup 2018. In this netbook, we will dig a bit deeper in those data and try to analyse multiple things.

In [1]:
import numpy as np
import json
import datetime
import tqdm

import seaborn as sns

from collections import Counter
from nltk.corpus import stopwords

import pandas as pd
from bson import json_util

import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import matplotlib.dates as md
from matplotlib import dates

from wordcloud import WordCloud

import pymongo
from pymongo import MongoClient

Creating DataFrame

With so few datas and no being very comfortable with NoSQL database, we will first create a DataFrame.

In [2]:
client = MongoClient('localhost', 27017)

db = client['Twitter_db']
collection_clean = db['tweets_clean']
In [3]:
print(json.dumps(collection_clean.find_one(), indent=4, default=json_util.default))
{
    "_id": 1013053260956098561,
    "text": "Come on France. #worldcup2018 #FRAARG",
    "time": {
        "$date": 1530372893000
    },
    "hashtags": [
        "worldcup2018",
        "FRAARG"
    ],
    "sentiment": 0.6849161215932269,
    "tokens": [
        "come",
        "on",
        "franc"
    ]
}

This what is stored for every tweets. We have the complete text, some tokens found with a TweetTokenizer and a Stemmer from nltk. We also have all hashtags and the predicted sentiment. Now we can create a DataFrame empty do be filled with datas from MongoDB.

In [4]:
nb = collection_clean.count_documents({})
df = pd.DataFrame(index=range(nb), columns=['ID','Text', "Time",'Hashtags','Sentiment','tokens'])

for i, record in enumerate(collection_clean.find()):
    obj = {
        'ID' : record["_id"],
        'Text' : record["text"],
        'Time' : record["time"],
        'Hashtags' : "-".join(record["hashtags"]),
        'Sentiment' : record["sentiment"],
        'tokens' : "-".join(record["tokens"])
    }

    df.iloc[i, :] = obj
    
df = df.set_index("ID")
In [5]:
client.close()
In [6]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 112700 entries, 1013053260956098561 to 1013096849245290496
Data columns (total 5 columns):
Text         112700 non-null object
Time         112700 non-null object
Hashtags     112700 non-null object
Sentiment    112700 non-null object
tokens       112700 non-null object
dtypes: object(5)
memory usage: 5.2+ MB
In [7]:
df.to_csv("F:/Twitter_data/dataset/fra_arg_full.csv", encoding="utf-8", sep="%")

Now we have converted our 112700 tweets to a dataframe and save it to not do those steps later

Exploration

Now, we will explore the content for multiple points but first let's prepare the required datas

In [8]:
df = pd.read_csv("F:/Twitter_data/dataset/fra_arg_full.csv", encoding="utf-8", sep="%", index_col=0)
df.tokens = df.tokens.fillna("N/A")
df.Hashtags = df.Hashtags.fillna("N/A")
df['Time'] = pd.to_datetime(df['Time'])
In [9]:
df.head(5)
Out[9]:
Text Time Hashtags Sentiment tokens
ID
1013053260956098561 Come on France. #worldcup2018 #FRAARG 2018-06-30 15:34:53 worldcup2018-FRAARG 0.684916 come-on-franc
1013053268950315008 This argentina bench actually has more talent ... 2018-06-30 15:34:55 N/A 0.832588 this-argentina-bench-actual-has-more-talent-th...
1013053270313619456 Let them know Messi, let them know. #WorldCup1... 2018-06-30 15:34:55 WorldCup18-FRAARG 0.824827 let-them-know-messi-let-them-know
1013053272758759425 #FRAARG \r\n\r\nI would be surprised if #Arg b... 2018-06-30 15:34:56 FRAARG-Arg-FRA 0.466349 would-be-surpris-if-beat-that-would-be-an-upse...
1013053274872799232 Today 😍\r\n\r\n#FRAARG #URUPOR https://t.co/7t... 2018-06-30 15:34:57 FRAARG-URUPOR 0.541423 today

First, we can look at all tokens and their frequencies. To do so, we will remove the standard StopWords from English Vocabulary and also the 2 created "words" which are "three_dots" and "exc_mark"

Most Commong Words

In [10]:
stopWords = stopwords.words('english')
stopWords += ['three_dot', 'exc_mark', "N/A"]
results = Counter()
df['tokens'].str.split("-").apply(results.update)

for word in stopWords:
    if word in results:
        del results[word]
In [11]:
results.most_common(50)
Out[11]:
[('argentina', 17392),
 ('franc', 15465),
 ('game', 13305),
 ('goal', 11682),
 ('mbapp', 11275),
 ('messi', 11196),
 ('match', 6240),
 ('world', 5906),
 ('go', 5467),
 ('2', 4832),
 ('team', 4831),
 ('di', 4772),
 ('maria', 4707),
 ('like', 4573),
 ('play', 4510),
 ('cup', 4272),
 ('1', 3930),
 ('one', 3660),
 ('win', 3653),
 ('vote', 3537),
 ('score', 3533),
 ('get', 3377),
 ('time', 3238),
 ('good', 3075),
 ('player', 3008),
 ('watch', 2932),
 ('best', 2735),
 ('footbal', 2701),
 ('see', 2693),
 ('look', 2693),
 ('pavard', 2642),
 ('4', 2606),
 ('come', 2592),
 ('today', 2496),
 ('penalti', 2319),
 ('great', 2314),
 ('3', 2252),
 ('fuck', 2217),
 ('back', 2193),
 ('tap', 2094),
 ('well', 2083),
 ('vs', 2049),
 ('maradona', 2036),
 ('half', 1947),
 ('far', 1917),
 ('make', 1903),
 ('live', 1882),
 ('strike', 1868),
 ('take', 1858),
 ('french', 1833)]

If we explore the result (I did the check on the top 1000 but I display the top 50 for readability), we can see that for several player for example, there is multiple grammars. The worst one is Mbappe which appears written in 5 ways.

In [12]:
print("mbapp:", results["mbapp"])
print("mbappe:", results["mbappe"])
print("mbappé:", results["mbappé"])
print("bappe:", results["bappe"])
print("bappé:", results["bappé"])
print("kilian:", results["kilian"])
print("killian:", results["killian"])
mbapp: 11275
mbappe: 322
mbappé: 755
bappe: 11
bappé: 4
kilian: 7
killian: 26

Due to all the cleanup required, this step will be continued just a bit later

Most Commong Hashtags

We can do the same with Hastags but in that case, we don't have to clean it and we can look at the "balance" using "cloudword"

In [13]:
results_tag = Counter()
df['Hashtags'].str.split("-").apply(results_tag.update)

for word in stopWords:
    if word in results_tag:
        del results_tag[word]

wordcloud = WordCloud().generate_from_frequencies(results_tag)

plt.figure(figsize=(20,12))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

Analysis of tweet frequencies during the match

We can group the dataframe by minute and juste count for now the number of tweets. We can after plot the result and add also some specific actions (goals) to see the effect

In [14]:
agg = df.Time.groupby([df.Time.dt.hour, df.Time.dt.minute]).agg(["min", "count"])
x = agg["min"].values
y = agg["count"].values

agg.head()
Out[14]:
min count
Time Time
15 34 2018-06-30 15:34:53 8
35 2018-06-30 15:35:00 64
36 2018-06-30 15:36:01 71
37 2018-06-30 15:37:00 65
38 2018-06-30 15:38:00 78
In [15]:
start_first = datetime.datetime(2018, 6, 30, 16, 0)
end_first = datetime.datetime(2018, 6, 30, 16, 47)
start_second = datetime.datetime(2018, 6, 30, 17, 2)
end_second = datetime.datetime(2018, 6, 30, 17, 51)

goal_time = [
    datetime.datetime(2018, 6, 30, 16, 13), 
    datetime.datetime(2018, 6, 30, 16, 40),
    datetime.datetime(2018, 6, 30, 17, 5),
    datetime.datetime(2018, 6, 30, 17, 14),
    datetime.datetime(2018, 6, 30, 17, 21),
    datetime.datetime(2018, 6, 30, 17, 25),
    datetime.datetime(2018, 6, 30, 17, 50)
]

goal_tweet = []
for goal in np.array(goal_time, dtype='datetime64[ns]'):
    for nb, time in zip(y, x):
        if abs((time - goal) / np.timedelta64(1, 's')) < 60:
            goal_tweet.append(nb)
            break
            
goal_team = ["Fr", "Arg", "Arg", "Fr", "Fr", "Fr", "Arg"]
goal_color = ["red" if team == "Fr" else "blue" for team in goal_team]
goal_player = ["Griezmann", "Di Maria", "Mercado", "Pavard", "Mbappe", "Mbappe", "Aguero"]
delta_x = [-0.020, -0.005, 0.010, 0.005, 0.005, 0.005, 0.01]
delta_y = [100,    -200,    -100,  -200,  -600,  -650,  0.050]

fault = datetime.datetime(2018, 6, 30, 16, 10)
for nb, time in zip(y, x):
    if abs((time - np.array(fault, dtype='datetime64[ns]')) / np.timedelta64(1, 's')) < 60:
        tweet_fault = nb
        break
In [16]:
fig = plt.figure(figsize=(20, 12))
ax = fig.add_subplot(111)

plt.plot(x, y)

plt.axvline(x=start_first)
plt.axvline(x=end_first)
plt.axvline(x=start_second)
plt.axvline(x=end_second)

plt.axvspan(start_first, end_first, alpha=0.3, color='green', label="first half")
plt.axvspan(start_second, end_second, alpha=0.3, color='orange', label="second half")

plt.scatter(goal_time, goal_tweet, c=goal_color)


for i in range(7):
    ax.annotate(goal_player[i], 
                xy=(md.date2num(goal_time[i]), goal_tweet[i]), 
                xytext=(md.date2num(goal_time[i])+delta_x[i], goal_tweet[i] + delta_y[i]),
                arrowprops=dict(facecolor=goal_color[i], shrink=0.05),
                color=goal_color[i],
                fontsize=20
            )

ax.annotate("Fault on Mbappe (Penalty)", 
            xy=(md.date2num(fault), tweet_fault), 
            xytext=(md.date2num(fault) - 0.001, tweet_fault -300),
            arrowprops=dict(facecolor="black", shrink=0.05),
            color="black",
            fontsize=20
    )


ax.xaxis.set_major_locator(dates.MinuteLocator(byminute=[0,15,30,45], interval = 1))
ax.xaxis.set_major_formatter(dates.DateFormatter('%H:%M'))

plt.ylabel("Number of Tweets", fontsize=15)
plt.title("Evolution of tweets posted during the match", fontsize=20)
plt.legend()
ax.grid(True)
plt.show()

A good way to find the start of all peaks is to derivate and look for max peaks. When we found the peak, we can extract perdios to explore and see which tokens are the most frequent

In [17]:
dy = []
dx = []
time_to_explore = []

for i in range(1, len(x)-1):
    dy.append((y[i+1] - y[i])/((x[i+1] - x[i]) / np.timedelta64(1, 's')))
    dx.append(x[i] + (x[i+1] - x[i])/2)

fig, (ax1, ax2) = plt.subplots(2, 1, figsize = (20,12))

ax1.plot(dx, dy)
ax2.plot(x, y)

for i in range(1, len(dy)-1):
    if dy[i] > 5:
        if dy[i] - dy[i-1] > 1:
            ax1.scatter(dx[i], dy[i])
            ax2.scatter(dx[i], y[i+1])
            time_to_explore.append((dx[i], dx[i] + np.timedelta64(360, 's')))
            plt.axvspan(dx[i], dx[i] + np.timedelta64(360, 's'), alpha=0.3, color='green', label="periods to explore")

plt.legend()
plt.show()

So now, we have all the timeframe to check. We can just use count every token for every periods and display the result as a "cloudword" too

In [18]:
plt.figure(figsize=(20, 25))

for i, portion in enumerate(time_to_explore):
    sub_df = df[ (portion[0] < df.Time) & (df.Time <= portion[1]) ]
    results_time = Counter()
    sub_df['tokens'].str.split("-").apply(results_time.update)

    for word in stopWords + ["argentina", "franc", "game"]:
        if word in results_time:
            del results_time[word]
    
    wordcloud = WordCloud().generate_from_frequencies(results)

    plt.subplot(5, 2, i+1)
    plt.imshow(wordcloud, interpolation="bilinear")
    plt.title("from {:%H:%M} to {:%H:%M}".format(datetime.datetime.fromtimestamp(portion[0].astype(datetime.datetime)/1e9), 
                                                 datetime.datetime.fromtimestamp(portion[1].astype(datetime.datetime)/1e9)), fontsize=20)
    plt.axis("off")
plt.show()

Player ranking

Now, we have a counter of words and from here, we can find all ways a player is named. The first thing to do will be to extract from every tweets the name of all presents players

In [19]:
def extract_name(x):
    to_add = []
    tokens = set(x.split("-"))
    for name, grammar in france_player.items():
        if len( set(grammar).intersection(tokens) ) > 0:
            to_add.append(name)
    for name, grammar in argentina_player.items():
        if len( set(grammar).intersection(tokens) ) > 0:
            to_add.append(name)
    return "-".join(to_add)

argentina_player = {
    "Messi" : ["messi", "lionel", "mess", "leo"],
    "Dybala" : ["dybala"],
    "Aguero" : ["aguero", "sergio", "agüero"],
    "Higuain" : ["higuain"],
    "Di Maria" : ["di", "maria", "dimaria", "maría", "mariaa"],
    "Mascherano" : ["mascherano"],
    "Caballero" : ["caballero"],
    "Meza"  : ["meza"],
    "Pavon" : ["pavon"],
    "Armani" : ["armani"],
    "Otamendi" : ["otamendi"],
    "Rojo" : ["rojo", "marco"],
    "Perez" : ["perez"],
    "Salvio" : ["salvio"],
    "Banega" : ["benega"],
    "Biglia" : ["biglia"],
    "Acuna" : ['acuna'],
    "Tagliafico" : ["tagliafico"],
    "Lo Celso" : ["lo", "celso", "lo celso", "locelso"],
    "Guzman" : ["guzman"],
    "Mercado" : ["mercado", "gabriel"],
    "Fazio" : ["fazio"],
    "Ansaldi" : ["ansaldi"]
}

france_player = {
    "Griezmann" : ["griezmann", "antoin", "griezman"],
    "Pogba" : ["pogba", "paul"],
    "Giroud" : ["giroud"],
    "Mbappe" : ["mbapp", "kylian", "mbappé", "mbappe", "mbape"],
    "Lloris" : ["llori", "hugo", "lloris"],
    "Dembele" : ["dembel", "dembele"],
    "Fekir" : ["fekir"],
    "Pavard" : ["pavard", "benjamin"],
    "Kante" : ["kant", "kante"],
    "Matuidi" : ["matuidi"],
    "Hernandez" : ["hernandez"],
    "Varane" : ["varan", "varane"],
    "Umtiti" : ["umtiti", "samuel"],
    "Rami" : ["rami"],
    "Thauvin" : ["thauvin", "florian"],
    "Tolisso" : ["tolisso"],
    "Mandanda" : ["mandanda"],
    "Kimpembe" : ["kimpemb", "kimpembe"],
    "Lemar" : ["lemar"],
    "Mendy" : ["mendy"],
    "Areola" : ["areola"],
    "Sidibe" : ["sidib", "sidibe"],
    "Nzonzi" : ["nzonzi"]
}

X = df["tokens"].apply(extract_name)

Now we have a Series with every tweets and the present name(s). We can convert it to a One Hot Encoded Matrix

In [20]:
X.head()
Out[20]:
ID
1013053260956098561         
1013053268950315008         
1013053270313619456    Messi
1013053272758759425    Messi
1013053274872799232         
Name: tokens, dtype: object
In [21]:
df_player = X.str.get_dummies("-")
df_player.head()
Out[21]:
Acuna Aguero Areola Armani Banega Biglia Caballero Dembele Di Maria Dybala ... Perez Pogba Rami Rojo Sidibe Tagliafico Thauvin Tolisso Umtiti Varane
ID
1013053260956098561 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1013053268950315008 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1013053270313619456 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1013053272758759425 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1013053274872799232 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 43 columns

We can now extract how often a player is mentionned and do a ranking.

In [22]:
df_player.sum(axis=0)
Out[22]:
Acuna             5
Aguero         1581
Areola            4
Armani          200
Banega            8
Biglia           11
Caballero        33
Dembele         103
Di Maria       5158
Dybala         1216
Fazio           206
Fekir            62
Giroud          980
Griezmann      1269
Guzman            1
Hernandez        54
Higuain         510
Kante           583
Kimpembe          2
Lemar            17
Lloris          355
Lo Celso         43
Mandanda          2
Mascherano      548
Matuidi         241
Mbappe        12463
Mercado         566
Messi         10836
Meza            238
Nzonzi            8
Otamendi        513
Pavard         2641
Pavon           268
Perez           129
Pogba          1637
Rami              1
Rojo           1154
Sidibe            9
Tagliafico       58
Thauvin         126
Tolisso          33
Umtiti          160
Varane           50
dtype: int64
In [23]:
score = Counter(dict(df_player.sum(axis=0)))

score_fr = Counter({key : value for key, value in score.items() if key in france_player})
score_arg = Counter({key : value for key, value in score.items() if key in argentina_player})

print("Most Frequent player from Arg")
print(score_arg.most_common(11))

print("\nMost Frequent player from Fr")
print(score_arg.most_common(10))
Most Frequent player from Arg
[('Messi', 10836), ('Di Maria', 5158), ('Aguero', 1581), ('Dybala', 1216), ('Rojo', 1154), ('Mercado', 566), ('Mascherano', 548), ('Otamendi', 513), ('Higuain', 510), ('Pavon', 268), ('Meza', 238)]

Most Frequent player from Fr
[('Messi', 10836), ('Di Maria', 5158), ('Aguero', 1581), ('Dybala', 1216), ('Rojo', 1154), ('Mercado', 566), ('Mascherano', 548), ('Otamendi', 513), ('Higuain', 510), ('Pavon', 268)]
In [24]:
x_arg, x_fr, y_arg, y_fr = [], [], [], []
name = []
for i, (player, num) in enumerate(score.most_common(20)):
    if player in france_player.keys():
        name.append(player)
        x_fr.append(i)
        y_fr.append(num)
    else:
        name.append(player)
        x_arg.append(i)
        y_arg.append(num)

fig, ax = plt.subplots(1, 1, figsize=(20,12))

p1 = ax.bar(x_fr, y_fr, color ="red")
p2 = ax.bar(x_arg, y_arg, color="blue")

ax.yaxis.set_tick_params(labelsize=12)

locs, labels = plt.xticks()
plt.xticks(range(20), name, rotation=90, fontsize=12)

plt.xlim(-0.5, 19.5)
plt.ylabel("Number of tweets", fontsize=15)
plt.title("Ranking of Players by number of tweets", fontsize=20)

ax.legend((p1[0], p2[0]), ('France', 'Argentina'), fontsize=15)

plt.show()

Instead of looking for player, we can do it for both teams

In [25]:
labels = 'Argentina', 'France'
sizes = [sum(score_arg.values()), sum(score_fr.values())]
colors = ['blue', 'red']
explode = (0.05, 0.05)

fig = plt.figure(figsize=(8,8))
patches, texts , autotxt = plt.pie(sizes, labels=labels, 
        colors=colors, 
        explode=explode,
        autopct='%1.1f%%', 
        shadow=True, 
        startangle=90)
for autotext in autotxt:
    autotext.set_color('white')
    autotext.set_fontsize(15)
    
for text in texts:
    text.set_fontsize(15)
    
plt.title("Number of tweets with \n a player name", fontsize=20)
plt.axis('equal')
plt.show()

We can see that the most mentionned player is Mbappe from France but in global, there is more tweets for Argentina.

Sentiments and Players

To go more in details with player, we have the sentiment of the tweet and the time. As a result, we can look at differents additional points :

  • Time where they are the most mentionned
  • The average sentiment for every player
  • The distribution of sentiments

To do so, we will have to group the dataframe by time

In [26]:
df_player_plus = df_player.join(df[["Time", "Sentiment"]])

agg_option = { x : ["sum", "mean"] for x in df_player_plus if x not in ["Time", "Sentiment"]}
agg_option["Sentiment"] = ["min", "max", "mean"]

agg_player = df_player_plus.groupby([df_player_plus.Time.dt.hour, df_player_plus.Time.dt.minute]).agg(agg_option)

agg_player.head()
Out[26]:
Acuna Aguero Areola Armani Banega ... Thauvin Tolisso Umtiti Varane Sentiment
sum mean sum mean sum mean sum mean sum mean ... mean sum mean sum mean sum mean min max mean
Time Time
15 34 0 0.0 0 0.000000 0 0.0 0 0.0 0 0.0 ... 0.0 0 0.0 0 0.0 0 0.0 0.466349 0.855800 0.705582
35 0 0.0 6 0.093750 0 0.0 0 0.0 0 0.0 ... 0.0 0 0.0 0 0.0 0 0.0 0.060213 0.982866 0.650325
36 0 0.0 2 0.028169 0 0.0 0 0.0 0 0.0 ... 0.0 0 0.0 0 0.0 0 0.0 0.049918 0.963130 0.627107
37 0 0.0 4 0.061538 0 0.0 0 0.0 0 0.0 ... 0.0 0 0.0 0 0.0 0 0.0 0.003206 0.950448 0.627897
38 0 0.0 2 0.025641 0 0.0 0 0.0 0 0.0 ... 0.0 0 0.0 0 0.0 0 0.0 0.032436 0.978462 0.616958

5 rows × 89 columns

For readability, we will look at those stats for only the 6 top players (Mbappe, Messi, Di Maria, Pavard, Pogba, Aguero).

We will plot below :

  • The number of tweet which mention a player for every minute
  • The cumulative sum all along the match
  • The percent of tweet every minutes having the name (more accurate than the number total as it is not regular)
In [27]:
X = agg["min"]

fig, (ax1, ax2, ax3) = plt.subplots(3, 1, figsize=(20,36))
for col in name[:6]:
    ax1.plot(X, agg_player[(col, "sum")], label = col)
    ax2.plot(X, np.cumsum(agg_player[(col, "sum")]), label = col)
    ax3.plot(X, 100*agg_player[(col, "mean")], label = col)

for ax in [ax1, ax2, ax3]:
    ax.xaxis.set_major_locator(dates.MinuteLocator(byminute=[0,15,30,45], interval = 1))
    ax.xaxis.set_major_formatter(dates.DateFormatter('%H:%M'))
    ax.legend(loc=2, fontsize=12)
    
ax1.set_ylabel("Number of tweet", fontsize=12)
ax2.set_ylabel("Number of tweet", fontsize=12)
ax2.set_ylabel("Percent of tweet", fontsize=12)

ax1.set_title("Number of tweets per minutes", fontsize=20)
ax2.set_title("Cumulative sum of tweets per minutes", fontsize=20)
ax3.set_title("Average of tweets having the name", fontsize=20)

yticks = mtick.FormatStrFormatter('%.0f%%')
ax3.yaxis.set_major_formatter(yticks)

plt.show()

We can see that Mbappe is nearly always the top one in term on number of time he is mentionned but in term of statistic, Di Maria, got the record of tweets per minutes just after his goal and his name was in 45% of tweets during this timelapse. We can also see one peak when Pavard scored with around 23% of tweets with his name.

Sentiment

For this part, we will have to change the dataframe shape again to have one column with sentiment and one column with player. Then we will be compare them

In [28]:
temp = df_player_plus[name[:10] + ["Sentiment"]]
temp = pd.melt(temp, id_vars="Sentiment", value_vars=name[:10])
temp = temp[temp.value != 0 ]
temp.head()
Out[28]:
Sentiment variable value
270 0.656446 Mbappe 1
321 0.674988 Mbappe 1
582 0.740536 Mbappe 1
608 0.755060 Mbappe 1
660 0.820606 Mbappe 1

If we do a jitter plot, we can see the balance of sentiment and anso the difference in term of number of time mentionned (1 dot = 1 tweet where the player is mentionned)

In [29]:
plt.figure(figsize=(20, 12))
sns.stripplot('variable', 'Sentiment', data=temp, jitter=0.5, alpha = 0.6)
sns.despine()
plt.show()

Nevertheless this is not very easy to read. If we want to see the trend, we should look at the distribution. The more to the right it will be, the better sentiment people have.

In [30]:
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(20,20))
for player in name[:10]:
    sns.distplot(temp[temp.variable == player]["Sentiment"], hist=False, rug=False, label=player, ax=ax1)
    sns.distplot(temp[temp.variable == player]["Sentiment"], hist=False, rug=False, label=player, hist_kws={'cumulative': True}, kde_kws={'cumulative': True}, ax=ax2)

ax1.set_title("Distribution of Sentiments", fontsize=20)
ax2.set_title("Cumulative distribution of Sentiments", fontsize=20)
plt.xlim(0, 1)
plt.legend(loc=2)
plt.show()

The player having the best average sentiment is the one where the curve is the most move to the right. In the cumulative sum, this means the curve the most to the bottom as it is the cumulative sum. As a result, we can say :

  • 1st : Pavard
  • 2nd : Di Maria
  • 3rd : Mbappe.
  • 4th/5th : Pogba and Griezmann
  • 6th : Giroud
  • 7th : Rojo
  • 8th : Messi (maybe due to a bad performance during the match, he is 2nd most mentionned)

Conclusion

In this project, we used few NLP tools (mainly used during the processing) to get more in-depth view of twitter opinion of players. We saw that the most mentionned player are not always the ones with only good evaluation. We also took a look at the evolution during the match and maybe used to train a model to detect when a goal is scored.